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Abstract 

Most online algorithms used in machine learning today are based on vari- 
ants of mirror descent or follow-the-leader. In this paper, we present an 
online algorithm based on a completely different approach, which combines 
"random playout" and randomized rounding of loss subgradients. As an 
application of our approach, we provide the hrst computationally efficient 
online algorithm for collaborative filtering with trace-norm constrained ma- 
trices. As a second application, we solve an open question linking batch 
learning and transductive online learning. 

1 Introduction 

Online learning algorithms, which have received much attention in recent years, enjoy an 
attractive combination of computational efficiency, lack of distributional assumptions, and 
strong theoretical guarantees. However, it is probably fair to say that at their core, most of 
these algorithms are based on the same small set of fundamental techniques, in particular 
mirror descent and regularized follow-the-leader (see for instance [2]). 

In this work we revisit, and significantly extend, an algorithm which uses a completely 
different approach. This algorithm, known as the Minimax Forecaster, was introduced 
in [SI [TT] for the setting of prediction with static experts. It computes minimax predictions 
in the case of known horizon, binary outcomes, and absolute loss. Although the original 
version is computationally expensive, it can easily be made efficient through randomization. 

We extend the analysis of [S] to the case of non-binary outcomes and arbitrary convex and 
Lipschitz loss functions. The new algorithm is based on a combination of "random playout" 
and randomized rounding, which assigns random binary labels to future unseen instances, 
in a way depending on the loss subgradients. Our resulting Randomized Rounding (R 2 ) 
Forecaster has a parameter trading off regret performance and computational complexity, 
and runs in polynomial time (for T predictions, it requires computing 0(T 2 ) empirical risk 
minimizers in general, as opposed to 0(T) for generic follow-the-leader algorithms). The 
regret of the R 2 Forecaster is determined by the Rademacher complexity of the comparison 
class. The connection between online learnability and Rademacher complexity has also been 
explored in [HH]. However, these works focus on the information-theoretically achievable 
regret, as opposed to computationally efficient algorithms. The idea of "random playout" , 
in the context of online learning, has also been used in [161 [3] , but we apply this idea in a 
different way. 

We show that the R 2 Forecaster can be used to design the first efficient online learning 
algorithm for collaborative filtering with trace- norm constrained matrices. While this is a 
well-known setting, a straightforward application of standard online learning approaches, 
such as mirror descent, appear to give only trivial performance guarantees. Moreover, our 
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regret bound matches the best currently known sample complexity bound in the batch 
distribution- free setting [2"T] . 

As a different application, we consider the relationship between batch learning and trans- 
ductive online learning. This relationship was analyzed in |16j . in the context of binary 
prediction with respect to classes of bounded VC dimension. Their main result was that 
efficient learning in a statistical setting implies efficient learning in the transductivc online 
setting, but at an inferior rate of T 3 / 4 (where T is the number of rounds). The main open 
question posed by that paper is whether a better rate can be obtained. Using the R 2 Fore- 
caster, we improve on those results, and provide an efficient algorithm with the optimal VT 
rate, for a wide class of losses. This shows that efficient batch learning not only implies 
efficient transductive online learning (the main thesis of [16]), but also that the same rates 
can be obtained, and for possibly non-binary prediction problems as well. 

We emphasize that the R 2 Forecaster requires computing many empirical risk minimizers 
(ERM's) at each round, which might be prohibitive in practice. Thus, while it does run 
in polynomial time whenever an ERM can be efficiently computed, we make no claim that 
it is a "fully practical" algorithm. Nevertheless, it seems to be a useful tool in showing 
that efficient online learnability is possible in various settings, often working in cases where 
more standard techniques appear to fail. Moreover, we hope the techniques we employ 
might prove useful in deriving practical online algorithms in other contexts. 

2 The Minimax Forecaster 

We start by introducing the sequential game of prediction with expert advice — see [10] , 
The game is played between a forecaster and an adversary, and is specified by an outcome 
space y, a prediction space V, a nonnegative loss function I : V x y — > E, which measures 
the discrepancy between the forecaster's prediction and the outcome, and an expert class 
J- . Here we focus on classes J- of static experts, whose prediction at each round t does 
not depend on the outcome in previous rounds. Therefore, we think of each f £ J simply 
as a sequence f = (/i, fa, . . .) where each f t e V . At each step t = 1,2, . . . of the game, 
the forecaster outputs a prediction p t £ V and simultaneously the adversary reveals an 
outcome y t £ y. The forecaster's goal is to predict the outcome sequence almost as well as 
the best expert in the class JF , irrespective of the outcome sequence y = (yi, y 2 , • • • )• The 
performance of a forecasting strategy A is measured by the worst-case regret 




viewed as a function of the horizon T. To simplify notation, let L(f, y) = X^t=i A/*' Vt)- 

Consider now the special case where the horizon T is fixed and known in advance, the 
outcome space is y = { — 1,+1}, the prediction space is V = [— 1, +1], and the loss is the 
absolute loss £(p, y) = \p — y\. We will denote the regret in this special case as V^ S (A, F). 

The Minimax Forecaster — which is based on work presented in [5] and [TT], see also [TU] 
for an exposition — is derived by an explicit analysis of the minimax regret inf^ Vf. bs (A, J 7 ), 
where the infimum is over all forecasters A producing at round t a prediction p t as a func- 
tion of PXtVii ■ ■ -Pt-i, Vt-i- For general online learning problems, the analysis of this quan- 
tity is intractable. However, for the specific setting we focus on (absolute loss and binary 
outcomes), one can get both an explicit expression for the minimax regret, as well as an 
explicit algorithm, provided inffgjr $^ t=1 £(ft, yt) can be efficiently computed for any se- 
quence j/i, . . . , 2/t- This procedure is akin to performing empirical risk minimization (ERM) 
in statistical learning. A full development of the analysis is out of scope, but is outlined in 
Appendix [X] In a nutshell, the idea is to begin by calculating the optimal prediction in the 
last round T, and then work backwards, calculating the optimal prediction at round T — 1, 
T — 2 etc. Remarkably, the value of inf^ Vf. bs (A, F) is exactly the Rademacher complexity 
1Zt(F) of the class J 7 , which is known to play a crucial role in understanding the sample 
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complexity in statistical learning [5]. In this paper, we define it asj]]: 



TIt{T) = E 
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where o\, . . . , a? are i.i.d. Rademacher random variables, taking values —Im- 
probability. When TZxiT) = o(T), we get a minimax regret mi a Vj. bs (A, T) 
implies a vanishing per-round regret. 

In terms of an explicit algorithm, the optimal prediction p t at round t is given by a 
complicated-looking recursive expression, involving exponentially many terms. Indeed, for 
general online learning problems, this is the most one seems able to hope for. However, an 
apparently little-known fact is that when one deals with a class T of fixed binary sequences 
as discussed above, then one can write the optimal prediction p t in a much simpler way. 
Letting Y±, . . . , Yt be i.i.d. Rademacher random variables, the optimal prediction at round 
t can be written as 
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In words, the prediction is simply the expected difference between the minimal cumulative 
loss over J 7 , when the adversary plays —1 at round t and random values afterwards, and 
the minimal cumulative loss over J 7 , when the adversary plays +1 at round t, and the same 
random values afterwards. Again, we refer the reader to Appendix [A] for how this is derived. 
We denote this optimal strategy (for absolute loss and binary outcomes) as the Minimax 
Forecaster (mf): 



Algorithm 1 Minimax Forecaster (mf) 

for t = 1 to T do 

Predict pt as defined in Eq. j3J 

Receive outcome y t and suffer loss \p t — yt \ 
end for 



The relevant guarantee for mf is summarized in the following theorem. 

Theorem 1. For any class J- C [— 1,+1] T of static experts, the regret of the Minimax 
Forecaster (Algorithm^ satisfies Vf. bs (MF, J") = TZriJ 7 )- 

2.1 Making the Minimax Forecaster Efficient 

The Minimax Forecaster described above is not computationally efficient, as the computa- 
tion of pt requires averaging over exponentially many ERM's. However, by a martingale 
argument, it is not hard to show that it is in fact sufficient to compute only two ERM's per 
round. 



Algorithm 2 Minimax Forecaster with efficient implementation (mf*) 

for t = 1 to T do 

For i = t + 1, ... ,T , let Yi be a Rademacher random variable 

Let p t := inf fe jr L (f , y x . . . y t ~i (-1) Y t+1 . . . Y T ) - mi feJ r L{f,y x ... y t -i 1 Y t+1 . . . Y T ) 
Predict pt , receive outcome yt and suffer loss \pt — yt \ 
end for 



Theorem 2. For any class J- C [— 1,+1] T of static experts, the regret of the randomized 
forecasting strategy mf* (Algorithm^ satisfies 

V£ bs (MF*, F) < TlriJ 7 ) + V2Tln(l/J) 



1 In the statistical learning literature, it is more common to scale this quantity by 1/T, but the 
form we use here is more convenient for stating cumulative regret bounds. 
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with probability at least 1 — S. Moreover, if the predictions p = (pi, . . . ,Pt) Q>Te computed 
reusing the random values Yy,...,Yr computed at the first iteration of the algorithm, rather 
than drawing fresh values at each iteration, then it holds that 



L(p,y)-inf L(f,y) 



<TIt{T) for all y G 



Proof sketch. To prove the second statement, note that |E[p t ]— y t | = E[|p t — y t \] for any fixed 
yt G { — 1, +1} and pt bounded in [—1, +1], and use Thm. [TJ To prove the first statement, 
note that \p t —yt\ — |E Pt [pt] — yt\ for t = 1, . . . ,T is a martingale difference sequence with 
respect to p±, . . . ,pr, and apply Azuma's inequality. □ 

The second statement in the theorem bounds the regret only in expectation and is thus 
weaker than the first one. On the other hand, it might have algorithmic benefits. Indeed, if 
we reuse the same values for Y±, . . . ,ir, then the computation of the infima over f in MF* 
are with respect to an outcome sequence which changes only at one point in each round. 
Depending on the specific learning problem, it might be easier to re-compute the infimum 
after changing a single point in the outcome sequence, as opposed to computing the infimum 
over a different outcome sequence in each round. 



3 The R 2 Forecaster 



The Minimax Forecaster presented above is very specific to the absolute loss £(f, y) = 
|/ — y\ and for binary outcomes y = {— 1, +1}, which limits its applicability. We note that 
extending the forecaster to other losses or different outcome spaces is not trivial: indeed, 
the recursive unwinding of the minimax regret term, leading to an explicit expression and 
an explicit algorithm, does not work as-is for other cases. Nevertheless, we will now show 
how one can deal with general (convex, Lipschitz) loss functions and outcomes belonging to 
any real interval [—6, b]. 

The algorithm we propose essentially uses the Minimax Forecaster as a subroutine, by 
feeding it with a carefully chosen sequence of binary values z tl and using predictions ft 
which are scaled to lie in the interval [—1, +1]. The values of Zt are based on a randomized 
rounding of values in [— 1,+1], which depend in turn on the loss subgradient. Thus, we 
denote the algorithm as the Randomized Rounding (R 2 ) Forecaster. 

To describe the algorithm, we introduce some notation. For any scalar / G [—5,6], define 
/ = f/b to be the scaled versions of / into the range [— 1,+1]. For vectors f, define 
f = (l/6)f. Also, we let d Pt £(pt, yt) denote any subgradient of the loss function £ with respect 
to the prediction p t . The pseudocode of the R 2 Forecaster is presented as Algorithm[3]below, 
and its regret guarantee is summarized in Thm. [3J The proof is presented in Appendix [B] 

Theorem 3. Suppose £ is convex and p-Lipschitz in its first argument. For any T C [—6, b] T 
the regret of the R 2 Forecaster (Algorithm^) satisfies 




V T (R 2 ^)<pn T (F)+pb^^ + 2j ^2T\n(^pj (4) 
with probability at least 1 — 8. 

The prediction p t which the algorithm computes is an empirical approximation to 



inf Lit, zx... z t -iOY t+1 . . . Y T - inf L f, z Y ■ ■ ■ z t . x 1 Y t+1 ■■■Y T 

by repeatedly drawing independent values to Yt+i, . . . ,Yt and averaging. The accuracy of 
the approximation is reflected in the precision parameter n. A larger value of r\ improves the 
regret bound, but also increases the runtime of the algorithm. Thus, r\ provides a trade-off 
between the computational complexity of the algorithm and its regret guarantee. We note 
that even when rj is taken to be a constant fraction, the resulting algorithm still runs in 
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Algorithm 3 The R 2 Forecaster 



Input: Upper bound b on \ft\, \y t \ for all t = 1, . . . ,T and f £ upper bound p on 
sup p yg [_ b ft ] y) |; precision parameter 77 > 

for i = 1 to T do 
ft :=0 

for j = 1 to 77 T do 

For i = . . . , T, let Yi be a Rademacher random variable 

Draw A := ML (f, z x . . . z t _ x (-1) Y t+1 . . . F T ) - ML (f , « x . . . z t _ x 1 y t+1 . . . Y T ) 

Let ft := ft + ^A 
end for 
Predict ft 

Receive outcome yt and suffer loss £(pt,yt) 
Let r t :=\{l-^d pt l{p u y t )) G [0,1] 

Let := 1 with probability r t , and z t :~ —1 with probability 1 — r t 
end for 



polynomial time 0(T 2 c), where c is the time to compute a single ERM. In subsequent results 
pertaining to this Forecaster, we will assume that 77 is taken to be a constant fraction. 

We end this section with a remark that plays an important role in what follows. 

Remark 1. The predictions of our forecasting strategies do not depend on the ordering of 
the predictions of the experts in J 7 . In other words, all the results proven so far also hold in 
a setting where the elements of T are functions f : {1, . . . ,T} — » V , and the adversary has 
control on the permutation 7Ti, . . . ,7Tt of {1, . . . , T} that is used to define the prediction /(7r t ) 
of expert f at time i0 Also, Thm. [7] implies that the value o/Vy bs (J r ) remains unchanged 
irrespective of the permutation chosen by the adversary. 



4 Application 1: Transductive Online Learning 

The first application we consider is a rather straightforward one, in the context of transduc- 
tive online learning j6]. In this model, we have an arbitrary sequence of labeled examples 
(xi,yi), . . . , (xt,?/t), where only the set {xi, . . . ,xt} of unlabeled instances is known to the 
learner in advance. At each round i, the learner must provide a prediction p t for the label 
of yt- The true label yt is then revealed, and the learner incurs a loss ((pt, yt)- The learner's 
goal is to minimize the transductive online regret Y^t=i {^(Pti Vt) — inf /gj^ £(f(xt), yt)) with 
respect to a fixed class of predictors J- of the form {14 f(x)}. 

The work [16] considers the binary classification case with zero-one loss. Their main re- 
sult is that if a class T of binary functions has bounded VC dimension d, and there exists 
an efficient algorithm to perform empirical risk minimization, then one can construct an 
efficient randomized algorithm for transductive online learning, whose regret is at most 
0(T 3 / 4 y/ ^ln(T)) in expectation. The significance of this result is that efficient batch learn- 
ing (via empirical risk minimization) implies efficient learning in the transductive online 
setting. This is an important result, as online learning can be computationally harder than 
batch learning — see, e.g., [8] for an example in the context of Boolean learning. 

A major open question posed by fTB] was whether one can achieve the optimal rate O(VdT), 
matching the rate of a batch learning algorithm in the statistical setting. Using the R 2 
Forecaster, we can easily achieve the above result, as well as similar results in a strictly 
more general setting. This shows that efficient batch learning not only implies efficient 
transductive online learning (the main thesis of |16|). but also that the same rates can be 
obtained, and for possibly non-binary prediction problems as well. 



2 Formally, at each step t: (1) the adversary chooses and reveals the next element -K t of the 
permutation; (2) the forecaster chooses p t G V and simultaneously the adversary chooses yt G 3^ 
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Theorem 4. Suppose we have a computationally efficient algorithm for empirical risk min- 
imization (with respect to the zero-one loss) over a class J- of {0,1} -valued functions with 
VC dimension d. Then, in the transductive online model, the efficient randomized forecaster 
mf* achieves an expected regret of C(V dT) with respect to the zero-one loss. 
Moreover, for an arbitrary class J- of [—b,b]-valued functions with Rademacher complexity 
IZt^F), and any convex p-Lipschitz loss function, if there exists a computationally efficient 
algorithm for empirical risk minimization, then the R 2 Forecaster is computationally effi- 
cient and achieves, in the transductive online model, a regret of plZT{J-)-\-0(pb^jT\n.(T / 5)) 
with probability at least 1 — 5. 

Proof. Since the set {xi, . . . , xt} of unlabeled examples is known, we reduce the online 
transductive model to prediction with expert advice in the setting of Remark [TJ This is 
done by mapping each function / G T to a function / : {1, . . . , T} — > V by t h-> f(xt), which 
is equivalent to an expert in the setting of RemarksQ] When J- maps to {0, 1}, and we care 
about the zero-one loss, we can use the forecaster mf* to compute randomized predictions 
and apply Thm.[2]to bound the expected transductive online regret with 1Zt(J-)- For a class 
with VC dimension d, IZriJ 7 ) < 0{\J dT) for some constant c > 0, using Dudley's chaining 
method [12] , and this concludes the proof of the first part of the theorem. The second part 
is an immediate corollary of Thm. [3] □ 

We close this section by contrasting our results for online transductive learning with those 
of [7] about standard online learning. If T contains {0, l}-valued functions, then the optimal 
regret bound for online learning is order of \Jd'T , where d' is the Littlestone dimension of 
J- . Since the Littlestone dimension of a class is never smaller than its VC dimension, we 
conclude that online learning is a harder setting than online transductive learning. 

5 Application 2: Online Collaborative Filtering 

We now turn to discuss the application of our results in the context of collaborative filtering 
with trace-norm constrained matrices, presenting what is (to the best of our knowledge) the 
first computationally efficient online algorithms for this problem. 

In collaborative filtering, the learning problem is to predict entries of an unknown m x n 
matrix based on a subset of its observed entries. A common approach is norm regularization, 
where we seek a low-norm matrix which matches the observed entries as best as possible. 
The norm is often taken to be the trace-norm [231 EH E] ; although other norms have also 
been considered, such as the max- norm |18j and the weighted trace- norm |201 113] . 

Previous theoretical treatments of this problem assumed a stochastic setting, where the ob- 
served entries are picked according to some underlying distribution (e.g., [231121) ). However, 
even when the guarantees are distribution-free, assuming a fixed distribution fails to capture 
important aspects of collaborative filtering in practice, such as non-stationarity [T7J. Thus, 
an online adversarial setting, where no distributional assumptions whatsoever are required, 
seems to be particularly well-suited to this problem domain. 

In an online setting, at each round t the adversary reveals an index pair (it,jt) and secretely 
chooses a value yt for the corresponding matrix entry. After that, the learner selects a 
prediction p t for that entry. Then y t is revealed and the learner suffers a loss £(pt,yt)- 
Hence, the goal of a learner is to minimize the regret with respect to a fixed class W 

of prediction matrices, Ylt=i ^(Pti Ut) — uifwew Et=i -Jt i Vt) ■ Following reality, we 
will assume that the adversary picks a different entry in each round. When the learner's 
performance is measured by the regret after all T = mn entries have been predicted, the 
online collaborative filtering setting reduces to prediction with expert advice as discussed 
in Remark [T] 

As mentioned previously, W is often taken to be a convex class of matrices with bounded 
trace-norm. Many convex learning problems, such as linear and kernel-based predictors, 
as well as matrix-based predictors, can be learned efficiently both in a stochastic and an 
online setting, using mirror descent or regularized follow-the-leader methods. However, 
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for reasonable choices of W, a straightforward application of these techniques can lead 
to algorithms with trivial bounds. In particular, in the case of W consisting of m x n 
matrices with trace- norm at most r, standard online regret bounds would scale like 0(ry/T) . 
Since for this norm one typically has r — O (y'mn) , we get a per-round regret guarantee 

of 0{y/mn/T). This is a trivial bound, since it becomes "meaningful" (smaller than a 
constant) only after all T = mn entries have been predicted. 

On the other hand, based on general techniques developed in [TS] and greatly extended in 
PQ, it can be shown that online learnability is information-theoretically possible for such W. 
However, these techniques do not provide a computationally efficient algorithm. Thus, to 
the best of our knowledge, there is currently no efficient (polynomial time) online algorithm, 
which attain non-trivial regret. In this section, we show how to obtain such an algorithm 
using the R 2 Forecaster. 

Consider first the transductive online setting, where the set of indices to be predicted is 
known in advance, and the adversary may only choose the order and values of the entries. 
It is readily seen that the R 2 Forecaster can be applied in this setting, using any convex class 
W of fixed matrices with bounded entries to compete against, and any convex Lipschitz loss 
function. To do so, we let {ik,jkSk=i t> e the set of entries, and run the R 2 Forecaster with 
respect to T = {t i-> Wi t j t : W £ VV}, which corresponds to a class of experts as discussed 
in Remark [1] 

What is perhaps more surprising is that the R 2 Forecaster can also be applied in a non- 
transductive setting, where the indices to be predicted are not known in advance. Moreover, 
the Forecaster doesn't even need to know the horizon T in advance. The key idea to achieve 
this is to utilize the non- asymptotic nature of the learning problem — namely, that the game 
is played over a finite m x n matrix, so the time horizon is necessarily bounded. 

The algorithm we propose is very simple: we apply the R 2 Forecaster as if we are in a 
setting with time horizon T = mn, which is played over all entries of the m x n matrix. By 
Remark [TJ the R 2 Forecaster does not need to know the order in which these rax n entries 
are going to be revealed. Whenever W is convex and I is a convex function, we can find an 
ERM in polynomial time by solving a convex problem. Hence, we can implement the R 2 
Forecaster efficiently. 

To show that this is indeed a viable strategy, we need the following lemma, whose proof is 
presented in Appendix [Cl 

Lemma 1. Consider a (possibly randomized) forecaster A for a class T whose regret after 
T steps satisfies Vt{A, J 7 ) < G with probability at least 1 — S > h. Furthermore, suppose the 
loss function is such that inf sup inf (t{p, y) — i{p' , y)j > 0. Then 

max Vt(A, J-) < G with probability at least 1 — 6. 

Note that a simple sufficient condition for the assumption on the loss function to hold, is 
that V = y and £(p, y) > £(y, y) for all p, y £ V. 

Using this lemma, the following theorem exemplifies how we can obtain a regret guarantee 
for our algorithm, in the case of W consisting of the convex set of matrices with bounded 
trace-norm and bounded entries. For the sake of clarity, we will consider n x n matrices. 

Theorem 5. Let £ be a loss function which satisfies the conditions of Lemma{^ Also, letW 
consist of n x n matrices with trace-norm at most r = 0(n) and entries at most b — 0(1), 
suppose we apply the R 2 Forecaster over time horizon n 2 and all entries of the matrix. Then 
with probability at least 1 — S, after T rounds, the algorithm achieves an average per-round 
regret of at most 

O ^ — ^ - uniformly over T — 1, . . . , n 2 . 

Proof. In our setting, where the adversary chooses a different entry at each round, |211 
Theorem 6] implies that for the class W" of all matrices with trace- norm at most r = O(n), 
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it holds that K T {W)/T < 0{n 3 l 2 /T). Therefore, 7^j(W) < 0(n 3/2 ). Since W C W, 
we get by definition of the Rademacher complexity that lZ n 2(W) = 0(n?/ 2 ) as well. By 
Thm. |3l the regret after n 2 rounds is 0(n 3 / 2 + n^\n(n/S)) with probability at least 1 — 5. 
Applying Lemma [TJ we get that the cumulative regret at the end of any round T = 1, . . . , n 2 
is at most 0{r?l 2 + ny/hi(n/S)), as required. □ 

This bound becomes non-trivial after n 3 / 2 entries are revealed, which is still a vanishing 
proportion of all n 2 entries. While the regret might seem unusual compared to standard 
regret bounds (which usually have rates of 1/y/T for general losses), it is a natural outcome 
of the non- asymptotic nature of our setting, where T can never be larger than n 2 . In fact, 
this is the same rate one would obtain in a batch setting, where the entries are drawn from 
an arbitrary distribution. Moreover, an assumption such as boundedness of the entries is 
required for currently-known guarantees even in a batch setting — see [21) for details. 
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A Derivation of the Minimax Forecaster 



In this appendix, we outline how the Minimax Forecaster is derived, as well as its associated 
guarantees. This outline closely follows the exposition in [TUJ Chapter 8], to which we refer 
the reader for some of the technical derivations. 

First, we note that the Minimax Forecaster as presented in [10] actually refers to a slightly 
different setup than ours, where the outcome space is y = {0, 1} and the prediction space is 
V = [0,1], rather than y = {— 1, +1} and V = [—1, +1]. We will hrst derive the forecaster 
for the first setting, and then show how to convert it to the second setting. 

Our goal is to find a predictor which minimizes the worst-case regret, 



where p = (pi, . . . ,pt) is the prediction sequence. 

For convenience, in the following we sometimes use the notation y to denote a vector 
in {0, 1}*. The idea of the derivation is to work backwards, starting with computing the 
optimal prediction at the last round T, then deriving the optimal prediction at round T — 1 
and so on. In the last round T, the first T — 1 outcomes y T_1 have been revealed, and we 
want to find the optimal prediction px- Since our goal is to minimize worst-case regret with 
respect to the absolute loss, we just need to compute pt which minimizes 

m aX {L(p T -\y T - 1 )+p T -m^L(f,y T - 1 0) , ^(p^ 1 , y^ 1 ) + (1 -p T ) - mf_L(f, y^l)} . 

In our setting, it is not hard to show that |inff e jr L(f , y* _1 0) — inffgjr^f, y t_1 l)| < 1 (see 
[101 Lemma 8.1]). Using this, we can compute the optimal px to be 

PT = UA T (y T - 1 l)-A T (y T - 1 0) + l) (5) 



2 

where A T (y T ) = - mi feJ r L({ , y T ). 

Having determined pr, we can continue to the previous prediction pr-i- This is equivalent 
to minimizing 

max{L(p^^y ^ - 2 )+p T _ 1+ A T _ 1 (y T - 2 0),i(p T -^y^ 1 ) + (l-p T _ 1 )-inf L(i,y T -H)} 
where 

At-iiy*- 1 ) - min max(p f - inf L^y'^O) , (1 - p t ) - inf L^y*- 1 !)) . (6) 



p t e[o,i] L feJr teJr 

Note that by plugging in the value of pr from Eq. ([S]) , we also get the following equivalent 
formulation for ^4T-i(y T_1 ): 

At-i^- 1 ) = \ (A T (y T -'0) + A T (y T -H) + l) . 

Again, it is possible to show that the optimal value of pt-i is 

PT-i = i(A T _ 1 (y T ^l)-A T (y T - 2 0) + l). 

Repeating this procedure, one can show that at any round t, the minimax optimal prediction 
is 

p* = ^(^(y t - 1 i)-^(y*- 1 o) + i) (7) 

where At is defined recursively as Ar(y T ) = — inffgjr L(f , y T ) and 

At-itf- 1 ) = 5(^(y t_1 0) + My 1 - 1 !) + l). (8) 

for all t. 
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At first glance, computing p t from Eq. (J7J might seem tricky, since it requires computing 
-^t(y') whose recursive expansion in Eq. ((8|) involves exponentially many terms. Luckily, 
the recursive expansion has a simple structure, and it is not hard to show that 



My') 



T-t 



2 T H 



ye{o,i} 3 



inf L(f,y'F 
far 



t v T-f 



T-t 



E 



inf i(f.y*y 



t v T-t\ 



(9) 



where F T * is a sequence of T — t i.i.d. Bernoulli random variables, which take values in 
{0, 1} with equal probability. Plugging this into the formula for the minimax prediction in 
Eq. 0, we get tha10 



P t = \ ( - 



inf L(f,y t_1 0y 



T ~*- inf Lffy-HY*-*) 



1 



(10) 



This prediction rule constitutes the Minimax Forecaster as presented in [10) . 

After deriving the algorithm, we turn to analyze its regret performance. To do so, we just 
need to note that Aq equals the worst-case regret — see the recursive definition at Eq. ([6]). 
Using the alternative explicit definition in Eq. (j^J), we get that the worst-case regret equals 



T 

2- E 



E 



sup 



\ft ~ Y t 



E 



sup ^ 



t=i v 



where at are i.i.d. Rademacher random variables (taking values of —1 and +1 with equal 
probability). Recalling the definition of Rademacher complexity, Eq. @, we get that the 
regret is bounded by the Rademacher complexity of the shifted class, which is obtained from 
J- by taking every f G T and replacing every coordinate ft by ft — 1/2. 

Finally, it remains to show how to convert the forecaster and analysis above to the setting 
discussed in this paper, where the outcomes are in { — 1,+1} rather than {0,1} and the 
predictions are in [— l.+l] rather than [0,1]. To do so, consider a learning problem in 
this new setting, with some class J ' . For any vector y, define y to be the shifted vector 
(y + l)/2, where 1 = (1, . . . , 1) is the all-ones vector. Also, define T to be the shifted class 
T = {(f + l)/2 : f £ J 7 }. It is easily seen that L(f , y) = 2L({, y) for any f , y. As a result, 
if we look at the prediction pt given by our forecaster in Eq. ((3|), then p~t = (pt + l)/2 is the 
minimax optimal prediction given by Eq. (|10[) with respect to the class T and the outcomes 



y . So our analysis above applies, and we get that 



y e{ m i a Vr (p ' y)_ ^ i(f ' y; 



max 2 L(p,y) — inf L(f, y) 

ye[0,lF V tVF 



2E 



= E 



T 

sup vy t / f 



which is exactly the Rademacher complexity of the class J- ' . 



B Proof of Thm. d 



Let Y(t) denote the set of Bernoulli random variables chosen at round t. Let E 2t denote 
expectation with respect to Zt, conditioned on z\, Y(l), . . . , Zt-i, Y(t — 1) as well as Y(t). 
Let Ey( t ) denote the expectation with respect to the random drawing of Y(t), conditioned 
on Zl ,Y(l),...,z t -i,Y{t-l). 

We will need two simple observations. First, by convexity of the loss function, we have that 
for any p t , f t , y t , i(p t ,yt) - &(ft,yt) < (Pt - ft) d Pt £(p t , Vt)- Second, by definition of r t and 

3 This fact appears in an implicit form in [S] — see also [101 Exercise 8.4]. 
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Zt, we have that for any hxed pt, ft, 

\{pt-ft)d P APt,yt) = J(Pt-/t)(l-2r t ) 

= \ r t{ft ~ Pt) + T (1 - n)(Pt - /t) 

= n(7t-ft) + (i-r t )(p t -7 t ) 



r t ((1 - ft) - (l - /*)) + (l - n) ((p t + 1) - (7 + l)) 



= E, 



b* - 2=t| 



/* - z t 



The last transition uses the fact that pt, ft G [— 1, +1]- By these two observations, we have 

T T T 

Y,l(pt,yt)-L(f,y)<Y,(Pt-ft)d Pt £(p t ,y t ) = pb ^E, t [\p t - z t \ - \ft - z t \] . (11) 
t=i t=i t=i 

Now, note that |p t - z t | - |/ t - z t \ - E Zt [|p 4 - z t \ — [ft - z t \] for t = 1, . . . ,T is a martingale 
difference sequence: for any values of zi, Y(l), . . . , Zt-i, Y(t — l),Y(t) (which fixes pt), the 
conditional expectation of this expression over zt is zero. Using Azuma's inequality, we can 
upper bound Eq. (fTTj) with probability at least 1 — 5/2 by 

T 

Pb J2 (\Pt ~ z *\ -\ft~ z *\) +P&V / 8Tln(2/£). (12) 
t=i 

The next step is to relate Eq. (fT2|) to jO&5Z^i(|Ey( t )[pi] — zt| — |/ t — z t |). It might be 
tempting to appeal to Azuma's inequality again. Unfortunately, there is no martingale 
difference sequence here, since Zt is itself a random variable whose distribution is influenced 
by Y(t). Thus, we need to turn to coarser methods. Eq. (1121) can be upper bounded by 

T T 

pb J2(\ E Y(t)\pt]-Z t \-\ft-z t \) +pb Y / \Pt-^Y(t)[pt]\+pbV8Tln(2/S). (13) 
t=l t=i 
Recall that pt is an average over rjT i.i.d. random variables, with expectation Ew t )[pt]' 
By Hocffding's inequality, this implies that for any t — 1, . . . ,T, with probability at least 

1 - S/2T over the choice of Y(t), \p t - E Y ( t )[Pt]\ < ^2\n(2T/5)/ (rjT). By a union bound, 
it follows that with probability at least 1 — 5/2 over the choice of Y(l), . . . , Y(T), 



^2 \Pt -Ey( t )b*]| < 



'2Tln(2T/<5) 
V 



Combining this with Eq. (ITUl) . we get that with probability at least 1 — 5, 

T 



P b Yl (l E ^(*)bt] - z t\ - l/t - zt\) + pb\ 
t=i 

Finally, by definition of p t — pt/b, we have 



/27Tn(2r/(5) 



+ pb v / 8Tln(2/S) . (14) 



E Y{t) [p t ] =E Y 



(i) 



ML (f, z x ... z t _! (-1) Y t+1 . . . Y T ) - inf L (f, z x ... zt-i 1 Y t+1 . . . Y 7 



This is exactly the Minimax Forecaster's prediction at round t, with respect to the sequence 
of outcomes Z\, . . . , z t -\ E {—1, +1}, and the class T := {f : f G J 7 } C [—1, 1] T . Therefore, 
using Thm. [T] we can upper bound Eq. (|14j) by 



pbTZriT) + pb. 



/2Tln(2T/<5) 



■ pby/KT\n{2/5) 



By definition of J- and Rademacher complexity, it is straightforward to verify that TZt (J~) = 
j/TZt{F)- Using that to rewrite the bound, and slightly simplifying for readability, the result 
stated in the theorem follows. 
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C Proof of Lemma [T] 



The proof assumes that the infimum and supremum of certain functions over y, T are 
attainable. If not, the proof can be easily adapted by finding attainable values which are 
e-close to the infimum or supremum, and then taking e — > 0. 

For the purpose of contradiction, suppose there exists a strategy for the adversary and a 
round r < T such that at the end of round r, the forecaster suffers a regret G > G with 
probability larger than 5. Consider the following modified strategy for the adversary: the 
adversary plays according to the aforementioned strategy until round r. It then computes 

r 

/* = argmin VV(/ t ,y t ) . 

At all subsequent rounds t = r + l,r + 2, ...,T, the adversary chooses 

y* t = argmax inf (£(p, y) - £(/ t *, y)) . 
y ey p^ v 

By the assumption on the loss function, 

efavt) - t{ft,vt) > K( £ (p> y*) - Hft'Vi )) = su p K( £ (p> y) - ^ ■ 

Thus, the regret over all T rounds, with respect to /*, is 

r T tt 

t=l t=r+l t=l t=l 

which is at least G' with probability larger than <5. On the other hand, we know that 
the learner's regret is at most most G with probability at least 1 — 5. Thus we have a 
contradiction and the proof is concluded. 
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